This notebook is designed to give you all the information you need to get started analyzing citation networks--and networks of just about anything else--using Python, NetworkX, NumPy, and Matplotlib.
We'll discuss the following points to help you get up and running quickly with the examples.
I'm Thom Neale a Boston-based software developer for the nonprofit, nonpartisan Sunlight Foundation in Washington DC. Most of the work I do is web development, scraper code, and some back-end work. I'm a recovering attorney, so I experiment with legal information in Python whenever possible. This talk is an outgrowth of research I did for the Canadian Legal Information Institute to help them decide how to expand the coverage of their databases of Canadian court decisions.
This notebook and the sample data it requires live on my github page at https://github.com/twneale/citation-network-analysis. To install locally, do the following:
$ virtualenv pydata2013
$ source ./pydata2013/bin/activate
$ git clone git@github.com:twneale/citation-network-analysis.git
$ cd citation-network-analysis
# Note: if the follwing step fails, you're missing dependencies. If resolving them is not you're cup of tea,
# I recommend installing the free anaconda distribution of python from Continuum Analytics: https://store.continuum.io/
$ pip install -r requirements.txt
Know this jargon, or keep this list handy to refer to, and you're ready to dive into network analysis:
</hr>
In [83]:
import networkx as nx
# Here's how you create a simple directed graph in NetworkX.
G = nx.DiGraph()
# A node can be any hashable object.
G.add_node('twneale')
G.add_node('daveneale')
# Edges are simply tuples of nodes. You associate abitrary data with an edge.
G.add_edge('twneale', 'daveneale', relationship="family")
# NetworkX has a plethora of useful methods for building networks.
G.add_edges_from([
('twneale', 'jamesturk'),
('twneale', 'paultag'),
('paultag', 'jamesturk')
])
# Networkx includes a "draw" function that plots the graph with matplotlib.
nx.draw(G)
We can also use NetworkX to identify cliques in the graph. But note that the clique problem is NP-complete, so if the clique detection algorithm in NetworkX isn't suitable for your use case, you'll find yourself in a dilly of a pickle.
In [84]:
undirected = G.to_undirected()
for clique in nx.find_cliques(undirected):
print clique
# Note that it does a pretty good job on our small sample graph, though.
# Note that identifying cliques is np-complete.
In [88]:
# Networks can be serialized to, and loaded from, JSON and a number of other formats.
# Here we're loading a graph
from networkx.readwrite import json_graph
with open('scc.json') as f:
G = json_graph.load(f)
from itertools import islice
# Now we can look at some basic characteristics of the network.
print 'Number of nodes:', G.number_of_nodes()
print 'Number of edges:', G.number_of_edges()
# Let's view some of the nodes.
print 'Here are a few nodes'
for node in islice(G.nodes_iter(), 5):
print ' -', node
# Let's view some of the edges. Note that they're just 2-tuples of nodes.
print 'Here are a few edges'
for edge in islice(G.edges_iter(), 5):
print ' -', edge
Now we have a larger network to explore. Good science is good observation, so let's make whatever observations we can about this example graph of case citations. And as you saw above, simply printing a few nodes and edges is profoundly useless and tells us nothing about the data. Network data is challenging in that, even for small networks, there's really no easy way to understand your data without plotting it's characteristics and looking at some charts. Someday when you're having a panic attack trying to figure out how to make sense of gigbytes of graph data, remember: charts.
In [111]:
# Creating a historgram is a one-liner with matplotlib.
from matplotlib import pyplot as plt
values = [1, 2, 3, 2, 1, 2, 3, 4, 4, 3, 5, 3]
plt.hist(values, max(values))
plt.show()
In [118]:
# This histogram below shows the distribution of publication years of citing cases.
# So we retrieved the publication year of the citing case for each network edge,
# then plotted the distribution of those values.
from matplotlib import pyplot as plt
import numpy
years = tuple(G.get_edge_data(*edge)['year'] for edge in G.edges_iter())
plt.hist(years, 14)
plt.xlabel('Years')
plt.ylabel('Number of Cases')
plt.show()
Viewing a plot the degree distribution of a network is an important step in getting familiar with your data. If the degree distribution conforms to a power law distribution, the network is "scale-free", meaning a large number of edges are distributed among a small number nodes, and a small number of edges are distributed among a large number of nodes. In short, a few nodes are hogging all the connections.
Why bother with this? If the network is scale-free, we may be able to apply network analysis techniques that are known to work well on other scale-free networks, like the Internet.
In [120]:
degrees = G.degree().values()
h, bins= numpy.histogram(degrees, bins=100)
x_series = bins.compress(h)
y_series = h.compress(h)
plt.plot(x_series, y_series)
plt.xlabel('Number of cases with y inbound citations')
plt.ylabel('Number of inbound citations')
plt.show()
Here's how to find the highest ranked source and print the sources that cite it:
In [123]:
from operator import itemgetter
# As you can see, calculating PageRank values for our graph isn't too tricky.
ranks = nx.pagerank(G)
# Let's retrieve the highest ranked case.
highest = sorted(ranks, key=ranks.get).pop()
first = itemgetter(0)
print 'The highest ranked case was %s (%r)' % (highest, ranks[highest])
print 'It was cited by these cases:'
print sorted(map(first, G.in_edges(highest)))
Another important step in verifying that your graph has the characteristics you expect is to test your data for a correlation with an external data source. Below we do a basic linear regression betweeh the number of pageviews each case recieved on CanLII's website and the number of inbound citations to each case (obtained via the G.in_degree() call):
In [157]:
import json
from scipy import stats
from sklearn.linear_model import LinearRegression
from matplotlib.lines import Line2D
# Now we load the top 1000 or so page view for Supreme Court of Canada Cases.
with open('pageviews.json') as f:
pageviews = json.load(f)
ranks = G.in_degree()
ids = set(pageviews) & set(ranks)
x = map(pageviews.get, ids)
y = map(ranks.get, ids)
res = stats.linregress(x, y)
slope, intercept, r_value, p_value, std_err = res
print ' -', 'r_value', r_value
print ' -', 'p_value', p_value
print ' -', 'standard deviation', std_err
We can see a weak correlation, which results from using a small subset of the full network as an example. When I ran this same regression using the indegree centrality scores of the full network of 1.2 million citation edges versus the top 1000 most popular cases from each jurisdiction, the correlation was veryu strong (over 0.40), as we might expect it to be. Another sign that nothing is horribly wrong with our graph! Keep those coming.
In [161]:
from collections import defaultdict
def prune(G, year):
get_edge_data = G.get_edge_data
for edge in G.edges():
_year = (get_edge_data(*edge) or {}).get('year')
if _year and _year != year:
G.remove_edge(*edge)
ranks = defaultdict(list)
for year in range(2000, 2014):
H = G.copy()
prune(H, year)
for _id, rank in H.in_degree().items():
ranks[_id].append(rank)
In [165]:
# Now we have indegree rankings for each case for each year:
import pprint
years = range(2000, 2014)
pprint.pprint(zip(years, ranks[highest]))
In [178]:
from itertools import dropwhile
def polyfit(x, y, degree):
results = {}
coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results['polynomial'] = coeffs.tolist()
# r-squared
p = numpy.poly1d(coeffs)
# fit values, and mean
yhat = p(x) # or [p(z) for z in x]
ybar = numpy.sum(y)/len(y) # or sum(y)/len(y)
ssreg = numpy.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat])
sstot = numpy.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y])
results['determination'] = ssreg / sstot
results['fit_fn'] = p
return results
def plot(case, degree=1):
vals = ranks[case]
vals = zip(range(2000, 2014), vals)
vals = list(dropwhile(lambda (year, deg): not deg, vals))
x_series, y_series = zip(*vals)
fig = plt.figure()
# Plot the scatter.
ax = fig.add_subplot(111)
ax.plot(x_series, y_series, 'o')
# Plot the regression line.
result = polyfit(x_series, y_series, degree)
ax = fig.add_subplot(111)
ax.plot(x_series, result['fit_fn'](x_series), label='Degree %d, r2=%r' % (degree, result['determination']))
#ax.fill_between(x_series, result['fit_fn'](x_series), facecolor='y', alpha=0.3)
plt.title('Citation: %s' % case)
ymax = max(y_series)
plt.legend()
plt.ylim(ymin=0, ymax=ymax + ymax * 0.1)
plt.xlim(xmin=min(x_series), xmax=max(x_series))
plt.show()
plot(highest, degree=1)
In [ ]: